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Physical parameters are often constrained from the data hkehhoods using sampling methods. 
Changing some parameters can be much more computationally expensive ('slow') than changing 
other parameters ('fast parameters'). I describe a method for decorrelating fast and slow parameters 
so that parameter sampling in the full space becomes almost as efhcient as sampling in the slow 
subspace when the covariance is well known and the distributions are simple. This gives a large 
reduction in computational cost when there are many fast parameters. The method can also be 
combined with a fast 'dragging' method proposed by Neal [Ij that can be more robust and efficient 
when parameters cannot be fully decorrelated a priori or have more complicated dependencies. I 
illustrate these methods for the case of cosmological parameter estimation using data likelihoods 
from the Planck satellite observations with dozens of fast nuisance parameters, and demonstrate a 
speedup by a factor of five or more. In more complicated cases, especially where the fast subspace 
is very fast but complex or highly correlated, the fast-slow sampling methods can in principle give 
arbitrarily large performance gains. The new samplers are implemented in the latest version of the 
pubhcly available COSMOMC code. 
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I. INTRODUCTION 

Bayesian methods are often used to compare physical 
models to data, where parameters in different models are 
most easily constrained by sampling from the posterior 
distribution. Sampling methods scale well with the di- 
mension of the parameter space, in the best case scaling 
linearly, compared to brute force integration which scales 
exponentially badly. However as the number of parame- 
ters in the models grow, the computational cost can still 
be challenging. This can happen either because the com- 
plexity of the physical model increases, or because, as the 
precision of the constraints increases, so does the com- 
plexity of the data analysis nuisance parameters. This 
paper describes a way to improve on the scaling with 
dimension in the case where the the additional param- 
eters are 'fast' parameters. This reduces computational 
cost and also makes feasible the analysis of more realis- 
tic models with more parameters than might otherwise 
be easily tractable. The methods are general, but for 
concreteness I will focus on examples in cosmological pa- 
rameter estimation that were the original motivation. 

The separation of fast and slow cosmological parame- 
ters was introduced in Ref. [2] . The idea is that calculat- 
ing a new likelihood when you change different parame- 
ters can have very different numerical cost depending on 
which parameters are changed. For example in cosmol- 
ogy changing a matter density parameter requires a full 
recalculation of the evolution of the background model 
and the perturbations, and then all the data likelihoods 
that depend on them: this is slow. However changing a 
primordial power spectrum parameter does not change 
the linear transfer functions, so only the integrals to cal- 
culate the theoretical power spectra from the transfer 
functions need to be done. If the data likelihoods are 
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fast, changing primordial power spectrum parameters is 
therefore much faster than changing density parameters. 

Furthermore there are often many nuisance parame- 
ters. For example the data likelihood often depends on 
a variety of parameters governing the calibration, noise 
levels and other characteristics of the experiment. In 
cosmology, the likelihood from the Planck satellite obser- 
vations has parameters governing foreground amplitudes 
and correlations, uncertainties in the instrument beam 
response, and calibration uncertainties. When these nui- 
sance parameters are changed, keeping other physical pa- 
rameters fixed, only the dependent likelihood function 
has to be recomputed, not the theoretical predictions 
nor other independent likelihoods for other data being 
used. There can easily be a speed hierarchy of a hun- 
dred or more between changing the nuisance parameters 
and changing the main parameters governing the phys- 
ical model. In some cases the nuisance parameters can 
be quickly numerically or analytically marginalized; how- 
ever, in other cases they cannot, or the nuisance param- 
eter posteriors are themselves of interest, so an efficient 
sampling method is required. 

When there are many more fast parameters than slow 
parameters, there are potentially large efficiency gains 
to be had from exploiting the different computational 
speeds. If all parameters are sampled efficiently, and 
changing fast parameters is much faster than changing 
slow parameters, there is at least a linear saving (by a 
factor of order l//siow, where /siow is the fraction of pa- 
rameters that are slow). However if exploration of the 
fast parameter space is difficult, or there are unmodelled 
degeneracies between fast and slow parameters, the sav- 
ing could be arbitrarily larger than this. 

I will focus in this paper on unimodal distributions that 
are nonpathological for good physical reasons, where sim- 
ple standard Metropolis-Hastings [31 13] sampling meth- 
ods work well. The outline of this paper is as follows: in 
Sec.|ll]l describe a general method for constructing a set 
of decorrelated fast and slow parameters to allow efficient 



movement within the fuU parameter space. This can eas- 
ily be used with standard Metropohs samphng, and also 
optionally with oversampling in the fast-parameter sub- 
space for significant performance gains in some cases. In 



Sec. Ill I describe an implementation of a fast-parameter 
'dragging scheme' introduced by Ref. [1]. This can be 
used in combination with the parameter decorrelation 
and adaptive methods to provide efficiency gains in some 
non-Gaussian problems, or new problems with strong 
fast-slow correlations where the correlation structure is 
not well known a priori. In Sec. |IV| I discuss the merits 
and scaling of the different methods in some simple lim- 
iting cases and a useful convergence measure, and then 
go on to discuss indicative performance in the specific re- 
alistic case of parameter inference from Planck satellite 
observations (for which these methods were originally im- 
plemented). I finish with conclusions, and discuss in the 
Appendix a few more general details of the proposal dis- 
tribution implemented in the public COSMOMC code^. 



II. FAST-SLOW DECORRELATION 

The performance of Metropolis-Hastings Markov- 
chain Monte-Carlo (MCMC) sampling methods depend 
strongly on the shape of the distribution and also on 
the choice of proposal distribution. If there are corre- 
lated parameters a good choice of proposal distribution 
will propose longer steps along the degeneracy directions. 
For simple distributions this is first done by intelligently 
choosing the base parameters [5] to remove tight non- 
linear degeneracies. Then using an estimate of the co- 
variance of the base parameters (e.g. from an initial run, 
or the first steps of an adaptive method) by orthogonal- 
izing the base parameters, so that proposals are made 
to linear combinations of the original base parameters 
that are nearly independent. This allows rapid move- 
ment through the distribution, and hence fewer steps be- 
tween independent samples and faster convergence to the 
desired distribution. 

The simplest method to exploit fast parameters is to 
make separate proposals sequentially is the fast and slow 
parameter subspaces [2]. However a potential problem 
is that fast and slow parameters can be arbitrarily cor- 
related. We want to redefine parameters to be as un- 
correlated as possible, but in such a way that that fast 
and slow parameters are not all mixed together, so that 
changing a subset of the new parameters remains fast. 

Fortunately it is possible to simultaneously decorrelate 
all the parameters, and maintain the same number of fast 
directions in parameter space. The idea is illustrated in 
Fig.[l] To do this we first order the parameters by speed, 
so that iii < j then Xi is slower than (or the same speed 
as) Xj . We can now make a linear parameter redefinition. 






FIG. 1: Possible proposal directions (arrows) for sampling 
a correlated 2D distribution: Left: a choice of orthogonal 
eigenvector directions explores efficiently but requires chang- 
ing both fast and slow parameters in both proposal directions; 
Centre: a choice that allows fast moves in the x-direction but 
is non-orthogonal; Right: By performing a linear shearing 
(parameter redefinition of the slow direction) the proposal 
distribution can be orthogonal and changes in the fast x di- 
rection can remain fast. 



SO that the new slow parameters depend on the original 
fast and slow parameters, but the new fast parameters 
do not depend on slow parameters. To do this Cholesky 
decompose the covariance as 



C = (xx^) = LL^, 



(1) 



where i is a lower triangular matrix. Then the new 
decorrelated parameters are taken to be x' = L~^x. If a 
proposed move in the new orthonormal x' space is x' — > 
x' + Ax', then in the original space 
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(2) 



The lower triangular nature of L ensures that if Ax' only 
has non-zero components with i > j, then parameters in 
X are also only modified where Xi has i > j. There- 
fore, the decorrelated parameters x' are also ordered by 
speed, in that changing the ith decorrelated parameter 
only requires calculating likelihood changes for parame- 
ters which are as fast or faster than^ Xi. 

To see the structure in a simple example, consider a 
parameter vector consisting of two slow parameters Si 
and several fast parameters Fi. The parameter vector 
x-'" = (S'iS'2Fii^2^3 • • ■ ) is then related to the orthonor- 
malized parameters x' by 
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^ http://cosmologist.info/cosmomc/ 



^ Of course this typically will be slower than just changing Xi, 
which is the price paid for decorrelation. If certain blocks of 
parameters are known to be only slightly correlated, their inde- 
pendence could be imposed so that the changing one docs not 
require changing parameters in the other block. 



where stars denote the non-zero elements of the lower 
triangular matrix L from the Cholesky decomposition of 
the covariance. This is of the form 



(4) 



where the Ms and Mp submatrices can be precom- 
puted from an estimate of the covariance. Under a 
proposed move in the slow orthonormalized subspace 
Ax' = (AS', 0) we then have 
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Ax = Ms AS', 



(5) 



so all of the original parameters are changed. However, 
under a move in the fast orthonormalized subspace Ax' — 
(0, AF'), only fast parameters change: AS = and 



AF = MfAF'. 



(6) 



This can be generalized to many blocks of fast and 
slow parameters of similar speed. Standard Metropolis- 
Hastings moves can now be made in the diagonalized 
fast-slow parameter space, and only moves along slow di- 
rections will be slow. This achieves fast movement in the 
slow parameter space, while maintaining efficiency gains 
from exploiting the fast-slow decomposition. 



A. Block randomization 

Proposing sequentially in the x' parameters allows for 
efficient exploration of parameter space, with changes to 
fast parameters remaining fast. However, we would also 
like our sampler to be a bit robust; for example, if the 
assumed covariance is not quite right, or the shape of the 
posterior is non-Gaussian. The choice of proposal distri- 
bution function is discussed in Appendix lAj In particular 
it is usually better to sample in random directions in x' 
space, rather than along the axes. The approach used 
in COSMOMC is to sample cyclically from a randomly 
rotated coordinate basis (with the random rotation pe- 
riodically changed, e.g. once per complete cycle). This 
allows all directions to be eventually proposed, and also 
improves robustness to the covariance in one direction 
being badly misestimated. 

With fast and slow parameters this is not a good idea 
because it would mix fast and slow parameters in each 
proposal. Instead, parameters can be blocked into groups 
of equal or similar speed, with the sampler making pro- 
posals along random directions in these equal-speed sub- 
spaces. This will be slightly less robust than using ran- 
dom directions in the full space, but perfectly preserves 
the fast-slow parameter separation. This is what is im- 
plemented in CoSMOMC along with random ordering of 
proposals between different blocks. 



B. Fast- parameter oversampling 

Given a set of parameter blocks of similar speed, we 
are free to choose the relative number of proposals made 
within each block. If all parameters are equivalent, the 
most efficient symmetric solution is to sample all param- 
eter directions equally (i.e., each block is sampled pro- 
portional to the number of parameters in it). However 
changing the fast parameters is fast, so it may be advan- 
tageous to make more steps in the fast blocks than in the 
slow blocks. I define a parameter /fast so that /fast times 
more fast parameter proposals are made than under the 
equal-sampling scheme; the total number of proposals per 
cycle is then A'siow + /fast-^fast- In general there could be 
many different speed blocks, and each could have its own 
/fast factor, but for simplicity I restrict to just one. Using 
/fast > 1 will in general be beneficial in that it allows bet- 
ter exploration of the fast subspace for small additional 
computational cost. The only cases where it is not likely 
to be beneficial is when it is known that the fast subspace 
is not actually that fast, the fast subspace is much easier 
to explore than the orthogonalized slow parameter space, 
or the fast parameters are nearly independent of the slow 
parameters and the only goal is to have small sampling 
error on functions of the slow parameters. 

Using /fast ^ 1 will allow fast parameters to fully ex- 
plore the conditional distribution for each point in the 
slow parameter space. This can be a large gain if this 
subspace is hard to explore, and in general helps to re- 
duce to sample variance fluctuations. However, it will not 
greatly improve convergence in the full parameter space 
if there are correlations with the slow parameters, since 
the overall sampling efficiency is still limited by the effi- 
ciency with which the slow parameter space is explored. 
The method described in the next section describes an 
alternative method that uses many fast likelihood eval- 
uations to improve the movement in the slow parame- 
ter space, which for distributions with general fast-slow 
dependencies and a large speed hierarchy can be more 
efficient. 



III. DRAGGING FAST VARIABLES 

In Ref. [T] Neal devised a general scheme for sampling 
fast and slow parameters by 'dragging' the fast param- 
eters along each slow Metropolis proposal. The idea is 
that when varying slow parameters ideally we would like 
to sample the fully fast-marginalized probability, so that 
there is no random walk behaviour from exploring corre- 
lations with fast parameters or funny shapes in the full 
parameter space. For Gaussian distributions the decor- 
relation scheme described in Sec. |ll] will of course achieve 
this anyway since the marginalized and conditional dis- 
tributions are the same in a fully orthonormalized pa- 
rameter space. However, in practice, the distributions 
are not Gaussian, and there may be only an approxi- 
mate covariance to do the decorrelation, so any scheme 



that samples from the fast-marginahzed distribution ef- 
ficiently will be more robust. Neal's dragging method 
asymptotically achieves this by a sampling method. 

The method works by making a proposal in the slow 
parameter space, and then running a chain in the fast 
parameter space that is guided to explore any fast-slow 
degeneracy and move towards the region of the fast pa- 
rameter space that has high likelihood at the proposed 
slow parameter point. By suitably sampling from dis- 
tributions that interpolate between those at the two end 
points, the fast parameters can be 'dragged' along any de- 
generacy direction, and the acceptance probability for the 
full slow move approaches that expected from sampling 
directly from the fast-marginalized distribution (which 
you can't actually calculate for large numbers of fast pa- 
rameters), see Fig. [2] 

The method is a variant of the Metropolis-Hastings 
algorithm, where at each point in the chain a move is 
proposed and then accepted depending on a probability 
ratio. Sampling from a distribution P{x, y) depending 
on A'fast fast parameters (direction x) and Ngiow slow 
parameters (direction y) the basic steps for symmetric 
proposal distributions are as follows [1] : 

1. From the current slow parameters y propose a new 
set of slow parameters y' 

2. Construct a series of n — 1 distributions Pi{x) that 
interpolate between P{x\y) and P{x\y') with 

In PAx)= (^-^)^^Pi^\y) + ^^^Pi^\y') (7) 

n 

for i = 0, . . . , 71. With this definition Po{x) = 
P{x\y) at the starting slow position and P„(a;) = 
P{x\y') at the proposed slow position. 

3. Sample a chain of fast parameters x by in turn sam- 
pling from Pi{x) for i = l,...,n— 1, starting at the 
current value of the fast parameters x = xq and 
ending at a final value x' = Xn-i- In practice this 
is done by making one or more Metropolis steps for 
the distributions Pi{x). 

4. Accept the entire move (a;,y) — > {x',y') with prob- 
ability 




FIG. 2: Illustration of sampling from a correlated 2D distribu- 
tion (contours, where the correlation is unknown a priori and 
hence not taken out by a variable decorrelation transforma- 
tion). Here x is a fast direction, y is slow. Consider propos- 
ing a move j/ — >■ y' as shown by the arrow in the top plot. 
Normally this would be immediately rejected with very high 
probability. Lower plot: the dragging method takes samples 
in the fast x direction from a series of interpolating distribu- 
tions [magenta lines] that interpolate between P{x\y) [blue] 
and P{x\y') [red]. This allows the fast samples to gradually 
explore the degeneracy direction, for example ending up at the 
red point in the upper plot. If there are enough interpolating 
steps the total move is accepted with probability similar to 
sampling from the marginalized distribution P{x) [solid black 
line in the lower plot] , which in general is not possible directly. 



n-l 



l,exp -^ [\nP{xi,y) - lnP(a;j, y')] 



(8) 



See Ref. [I] for further details and the proof that the 
method satisfies detailed balance and hence converges to 
the desired distribution. Note that at the last step the 
chain is not sampling from P{x\y'), but the interpolat- 
ing distribution P„_i(x) which is one step away towards 
P{x\y). For tightly correlated variables as shown in Fig. [2] 
the overall acceptance probability will only be high if the 
number of steps is high enough that P„_i(x) is close to 
P{x\y'), and hence a large number of steps are required. 



If the variables are uncorrelated, then n = 1 will work 
well, which is equivalent to a standard Metropolis step 
with no interpolating distributions. It may be advanta- 
geous to make one or more additional standard Metropo- 
lis steps from P{x\y') after each dragging step, but any 
difference in performance will be small for large numbers 
of interpolating distributions. For uncorrelated fast and 
slow parameters a large n allows the fast parameters to 
move to a new random position for each slow move, and 
when there are correlations between fast and slow param- 



eters it allows the fast parameters to move to the region 
of high likelihood at the new slow position. 

Given that the current value of P{x,y) is known, the 
entire dragging step only requires one new slow evalua- 
tion at the new value y' . Whether the method is helpful 
in practice depends on whether all the fast likelihood 
evaluations for P{xi,y) and P{xi,y') are fast enough 
that the additional computational cost is worth it for im- 
proved movement in the slow parameter subspace. Note 
that the method requires 2n fast likelihood evaluations 
per slow evaluation. 

There are free parameters in the method determining 
the number of intermediate distributions to sample from 
n, and the number of intermediate Markov chain steps 
at each intermediate distribution. Typically the num- 
ber of interpolating distributions required will depend on 
the number of fast dimensions, and I define /drag so that 
n — /drag-^fast (sec e.g. [S], though in general the optimal 
number of interpolating steps will depend in a unobvious 
way on the structure of the distribution being sampled). 
Since the method benefits from the interpolating distri- 
butions being as close to each other as possible Cos- 
MOMC does just one Metropolis step per interpolating 
distribution. There is also a choice over which parame- 
ters to drag; for the scheme to be efficient, there needs to 
be good hierarchy in the parameter speeds. COSMOMC 
groups fairly slow parameters together, and treats these 
as 'slow' variables, and groups all the faster parameters 
together and drags them. It may be possible to devise a 
more general hierarchical dragging scheme, but it would 
require a good hierarchy between all the blocks and is not 
well motivated by current applications at this point. If 
there are fast parameters which are known to be nearly 
independent to other parameters these could be sepa- 
rated out and sampled in their own standard Metropolis 
steps to avoid the computational overhead of unnecessar- 
ily dragging them. 



IV. PERFORMANCE COMPARISON 

The performance of different sampling methods can 
vary significantly between problems, and there is no gen- 
eral 'best' method. I start by briefly considering a few 
simple limiting cases where the behaviour of the sam- 
pling methods is easily understood, and then go on to 
discuss indicative convergence statistics for various fast- 
slow sampling methods when applied to sampling for cos- 
mological parameter analysis. 

• Independent fast and slow parameters: The 
speedup from separating the fast and slow propos- 
als here is 0(1 -I- A^fast/^siow)- The advantages to 
oversampling the fast parameters (/fast > 1) in this 
trivial case are only the following: (1) greater sam- 
pling accuracy on the fast parameters; (2) slight re- 
duction the variance of any calculations involving 
fast parameters; and (3) more rapid convergence 



if the fast parameter subspace is harder to sample 
than the slow subspace. The dragging scheme has 
no advantages in this case and an overhead that 
makes it less efficient. 

• Gaussian posterior with correlated fast and slow 
parameters: When the covariance is known accu- 
rately a priori, the Cholesky parameter rotation 
will completely decorrelate the parameters and for 
Gaussian parameters also make them independent. 
This case is then equivalent to the first case. How- 
ever if the covariance is not known a priori, a drag- 
ging scheme may be much more efficient than a 
Metropolis scheme because it allows the slow pa- 
rameter space to still be explored almost as though 
the fast parameters had been marginalized out. In 
practice, an adaptive scheme that learns the covari- 
ance will be even more efficient (such as that im- 
plemented in CoSMOMC; see Sec.|IVB2]below). 



• Modest speed hierarchy: The Cholesky decorrela- 
tion scheme will allow relatively modest speed hi- 
erarchies to be exploited. In this case /fast ~ 1 is 
likely to be optimal unless the faster space is sig- 
nificantly harder to sample. However for an 0(1) 
speed hierarchy the dragging scheme is likely to be 
significantly slower due to the likelihood overhead 
and the imposed asymmetry between the treatment 
of the different parameters. Also, in the limit that 
all parameter speeds are very nearly the same it 
may be better not to use a fast-slow method at all, 
so that random rotations of the proposal explore 
all possible directions. 

Note that if there are two likelihood components 
with identical internal speed but some common nui- 
sance parameters, there is still a speed hierarchy: 
when updating the shared parameters both likeli- 
hoods must be reevaluated, but updating the inde- 
pendent parameters only one likelihood changes, so 
the common shared parameters are twice as slow as 
the independent parameters. 

• Very fast parameters: In the case that the com- 
putational cost of fast steps is always negligible, 
the dragging method will always be more efficient 
since it becomes equivalent to sampling in the fast- 
marginalized slow subspace (except in the special 
case where fast and slow parameters are indepen- 
dent) . 

Note that the correlation lengths of the output chains 
are quite different in the different sampling methods. 
The dragging scheme only generates new samples where 
changes in the slow parameters are proposed, but the 
Metropolis schemes will generate many more samples in- 
cluding steps that only involve fast parameters. For inde- 
pendent fast and slow parameters the dragging method 
output is roughly equivalent to taking the Metropolis 
output and thinning it by a factor 1 -I- /fast-^fast/-^siow 



COSMOMC automatically thins Metropolis runs by a fac- 
tor /fast so that large /fast can be used without making 
output files very large and avoiding possible additional 
disk access overheads. 



A. Quantifying convergence 

There are many ways to test and quantify chain con- 
vergence, all of which are necessary but not sufficient to 
guarantee correct answers. Here I quantify chain conver- 
gence using a single simple generalized Gelman-Rubin 
statistic [3 [5], R— 1- This measures the variance in pa- 
rameter means evaluated from different chains in units 
of the parameter variance, which should be a small num- 
ber for well-converged chains so that the posterior means 
are accurately measured. In detail, using all of the post 
burn-in samples from n chains, where each sample gives 
a vector of parameter values x, this is evaluated with the 
following steps: 

• Calculate the mean x^ from each chain, and the 
total mean from all chains x. 

• Calculate the covariance between n chains of the 
chain means 

1 " 

Cx = V'lXi - x)(Xi - x)'^. 

n — 1 '^ — ' 

i=l 

This should be small compared to the parameter 
covariance if the chains are all well converged. 

• Calculate the mean of the covariances within each 
chain 

1 " 

Cx = ^^ V'wi((x-Xi)(x-Xj)'^) 

where angle brackets denote the weighted mean 
within the chain and Wi is the sum of the weights 
in each chain. 

• Cholesky decompose the mean covariance Cx — 
LL^ to orthonormalizc the parameters by forming 
L-ix. 

• Calculate the eigenvalues Di of the between- 
chain covariance of the orthonormalized parameter 
means using i-^Cxii"^]'^ = UDU^. 

• Define R—\ = max(Di) to measure the largest vari- 
ance between chains of any of the orthonormalized 
parameter means. 

The R — 1 statistic can be evaluated for all parame- 
ters or just a subset of particular interest. Specifically 
I shall compare the values obtained from just the slow 
parameters with the value for all of the (fast-|-slow) pa- 
rameters. Convergence in the slow subspace can be sig- 
nificantly better than in the full space if there are only 
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FIG. 3: Indicative correlation length in units of slow param- 
eter steps for Planck parameter estimation runs in the base- 
line six-parameter model with known covariance. Left shows 
Planck+WP (iVfast = 13), right shows P/ancA; -I- WP+highL 
(A'^fast = 31). Solid lines show the correlation length for one 
of the slow parameters, dashed lines show one of the fast 
parameters. Using the dragging method with /drag > 2 or 
Metropolis with /fast ^ 4 reduces the fast-parameter corre- 
lation length to be similar or smaller than the slow correla- 
tion length, improving convergence compared to the simplest 
fast-slow scheme (Metropolis with /fast = 1). Actual total 
efficiency depends on the relative speed of the fast and slow 
steps. 



weak fast-slow dependencies and the fast space is hard 
to explore (and vice versa). Requiring convergence on 
the full space is always more conservative, and typically 
chains are run to obtain R — 1 < 0.1. Smaller values 
may be required if dense sampling is required for esti- 
mating two and three-dimensional densities, robust eval- 
uation of confidence intervals, or later importance sam- 
pling. Often aiming for _R— 1 ^ 0.02 works well for many 
purposes. If confidence intervals are the primary con- 
cern the Gelman-Rubin statistic can also be generalized 
to test convergence of the between-chain variance of the 
confidence intervals (rather than the default measure of 
the covariance of the means) . 



B. Cosmological parameters with Planck 

The Planck satellite provides high-resolution maps of 
the microwave background and corresponding likelihood 
functions. In addition to the theoretical power spectrum 
(a function of spherical wavenumber Z), the likelihood 
also depends on a number of calibration, beam mode, 
foreground amplitude and foreground correlation 'nui- 
sance' parameters as described in detail in Ref. ^. The 
latter parameters are important to model the small-scale 
data reliably where foreground contamination and beam 
errors are non-negligible. However in this regime the like- 
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0.018 


0.011 


Metropolis /fast = 16 


0.011 


0.007 



Planck+WP+highL (A^siow = 6, A^fast = 31) 


Method 


All {R - 1) 


Slow {R - 1) 


No fast-slow 


0.25 


0.074 


Dragging /drag = 1 
Dragging /drag = 2 
Dragging /drag = 3 
Dragging /drag = 5 


0.031 
0.020 
0.023 
0.027 


0.010 
0.009 
0.008 
0.027 


Metropolis /fast = 1 
Metropolis /fast = 2 
Metropolis /fast = 4 
Metropolis /fast = 8 
Metropolis /fast = 16 


0.042 
0.015 
0.014 
0.012 
0.011 


0.004 
0.003 
0.008 
0.008 
0.008 



TABLE I: Typical convergence values achieved using various fast-slow sampling methods after fixed wall time for the case 
study of six baseline cosmological parameters from Planck data, including also WMAP polarization (WP). The left table is 
for a five-hour run with a relatively unconstrained and highly degenerate foreground (fast nuisance parameter) model with 
Nsiow = 6, A^fast = 13. The right table shows results from an eight-hour run also using additional data (highL) that constrain 
the foreground model better, making it less degenerate and more Gaussian, but adding 18 additional fast nuisance parameters. 
The columns give Gelman-Rubin R ~ 1 values evaluated for the least-converged direction in either the full parameter space 
(All) or the slow parameter subspace (Slow - which are the parameters of most physical interest in this case). The precision 
quoted is higher than sampling fluctuations, so exact numbers should not be over-interpreted, but the great improvement 
over the non-fast-slow method is clear, and also the advantage of using /fast > 1 for Metropolis sampling. In all cases an 
accurate precomputed fixed covariance matrix was used to orthogonalize the parameters. The fast and slow parameters are 
well decorrelated, so the overhead of the dragging method makes it less efficient than /fast 2> 1 with Metropolis sampling in 
this case. 



lihood is also well described for most theoretical models 
by a fiducial Gaussian approximation, where one likeli- 
hood evaluation only requires a fast matrix vector multi- 
plication using a precomputed inverse covariance matrix. 
Since the foreground models are all very simple power 
laws or template amplitudes, when any of the nuisance 
parameters are changed only the high-Z part of the likeli- 
hood needs to be recomputed, not the theoretical power 
spectra nor the low-^ likelihood. The nuisance parame- 
ters are therefore fast parameters compared to the slow 
parameters determining the cosmological model which re- 
quire a new calculation of the theoretical power spectra 
using CAMB (and also the low-/ polarization likelihood). 
For this test case the speed hierarchy is O(IOO), so that 
0(100) nuisance parameter changes can be made for the 
same computational cost as one change in the slow cos- 
mological parameters. 

Here I consider just two simple cases to give indicative 
performance. I follow the parameter estimation assump- 
tions described in Ref . [^ , and use the public Planck like- 
lihood code^ in combination with COSMOMC (which uses 
CAMB for the theoretical calculation [TOj). The two cases 
have differing properties and numbers of fast parame- 
ters: "Planck+WP" is where the likelihood only includes 
Planck data and polarization from WMAP (WP; |11|): 
" Planck +WP+highU^ is an extended data combination 
which includes data from other high-Z CMB observations 



(highL; [Hllig). The Planck likelihood depends on 13 
nuisance parameters, the highL likelihood on 24, but 6 of 
these are determining physical foregrounds that are com- 
mon to the Planck likelihood. There are therefore a total 
of 13 and 31 different fast parameters in the two cases, 
with one fast block in the Planck+WP case and two fast 
blocks (taken to be of 13 and 18 parameters each) in 
the Planck+highL case. The WMAP polarization likeli- 
hood has no nuisance parameters and does not need to 
be recomputed unless the slow cosmological parameters 
change. 

The underlying cosmological model is usually de- 
scribed by six or more slow cosmological parameters'*. 
For the theoretical calculation there are two blocks of 
slow parameters: general cosmological parameters that 
require full reevaluation of the linear transfer functions, 
and semi-slow parameters than determine the initial con- 
ditions as parameterized by the initial power spectrum. 
In itself there is a significant speed difference between 
these blocks, but both sets of parameters also require the 
\ow-l likelihood to be reevaluated using the theoretical 
prediction, and this step can have a non-negligible com- 
putational cost that makes the total hierarchy in speed 
rather more modest. This speed hierarchy can still be ex- 
ploited using two blocks of Cholesky decorrelated param- 
eters, but it is not large enough for a dragging method 



http : //www . sclops . esa . int/wikiSI/planckpla/ 



* For speed I will use zero neutrino mass here, which slightly 
favours fewer fast steps compared to a realistic calculation. 



to be efRcient. 

The convergence for given wall time depends on the 
method used and the relative speed of the fast and slow 
calculations. There is a trade-off between having more 
fast steps — which improves movement in the slow pa- 
rameter space for the dragging scheme, and in general 
speeds convergence of the fast subspace — and slowing 
things down by having so many fast steps that their com- 
putational cost becomes significant. In all test cases I ran 
four chains (on one node, with four cores per chain, 16 
cores in total) with a variety of sampling methods, and 
used a proposal distribution scale of 2 (see Appendix IaI). 



The best fast-slow parameter choices seems to perform 
well, with convergence achieved for fixed number of slow 
likelihood evaluations being similar to that expected for 
sampling with no nuisance parameters. The speed hit 
from the nuisance parameters is therefore limited to the 
numerical cost of calculating the fast steps, which for 
good parameter choices does not dominate the numerical 
cost of the slow step; the overall efficiency is then within 
0(1) of the performance expected for no fast nuisance 
parameters. 



2. New parameters with unknown covariance 



1. Baseline case with known covariance 

First I use an accurate precomputed covariance ma- 
trix for the parameter decorrelation, which will tend to 
favour standard Metropolis over dragging schemes, and 
sample the six parameters of the baseline ACDM cosmol- 
ogy. The R—1 convergence statistics achieved after fixed 
wall time are summarized in Table ll] for various sampling 
configurations. Corresponding chain correlation lengths 
are shown in Fig. [3] as a function of the number of slow 
steps along the chain. 

All the fast-slow methods considered significantly out- 
perform the simple non-fast-slow Metropolis method, 
with speedup 0(1-1- Mast/A^siow)- The fast parameter 
space is actually less Gaussian than the slow space (e.g. 
due to various hard parameter priors), and hence harder 
to explore. The fast-slow method speed-up can actu- 
ally be better than 1 -I- A^fast /A'siow since as shown in 
Fig.|3]it is the fast parameters that determine the overall 
correlation length; the fast-slow sampling schemes (for 
/fast ^ 1) can reduce this to being less than the slow 
parameter correlation lengths and hence give additional 
gains. This is mitigated by the non-negligible cost of the 
fast steps, so that eventually performance becomes worse 
as vastly more fast calculations are required per slow step 
(the dragging method with /drag = 5 is clearly worse in 
the 37-parameter case than /drag = 3). 

For the fast-slow Metropolis schemes the convergence 
in the full space is significantly improved by having 
/fast ^ 1- Since the fast and slow parameters are decor- 
related there is actually a trade-off between better con- 
vergence in the fast subspace and the slow subspace: for 
/fast ^ 4 convergence in the slow parameter space be- 
comes worse due to the lower total number of slow steps 
that can be performed in the given computation time, but 
larger /fast continues to improve fast subspace conver- 
gence. The dragging methods achieve good convergence 
in the full and slow subspaces for /drag ^ 2, though they 
are outperformed in this case by the simpler Metropolis 
scheme since there is only a modest gain in slow sub- 
space movement once the parameters are fully decorre- 
lated. Convergence in the fast subspace could also be im- 
proved by using a proposal distribution more tailored to 
the non-Gaussian shape of the fast-parameter subspace. 



In reality the covariance is also often unknown a pri- 
ori, and a common situation is testing a new model with 
new parameters (with unknown correlations) in addition 
to the six parameters of the baseline ACDM model, or 
changing the number and modelling of the nuisance pa- 
rameters. In these cases it is usually highly beneficial 
to do an initial run to estimate the covariance, or use 
an adaptive scheme that gradually learns the covariance, 
with decorrelation of the parameters in subsequent steps. 
COSMOMC uses an adaptive scheme that uses a growing 
fraction of the previous chain samples to estimate the 
covariance used for the proposal distribution. 

The adaptive method is asymptotically valid as long as 
the number of samples used to estimate the correlation 
grows as a fixed fraction of the total samples so the covari- 
ance itself converges, and hence the subsequent updates 
are Markovian. While the covariance estimate is chang- 
ing significantly the process is strictly non-Markovian, 
and these early fraction of steps can be removed as ex- 
tended burn in. CoSMOMCs algorithm is essentially a 
version of the adaptive Metropolis method [T4; , with the 
covariance being updated periodically from the average of 
multiple chains for speed of MPI implementation. Since 
the proposal is only updated periodically the method is 
also piecewise Markovian in the limit that the exploration 
time is short compared to the update time. For dis- 
cussion of ergodicity of adaptive Metropolis methods see 
Refs. [m [15] and references therein. 

As a simple test case I'll consider a simple approxi- 
mation to adding a new set of fast parameters with un- 
known covariance. This is a common situation: for ex- 
ample, before the Planck nuisance-parameter model was 
defined the expected cosmological parameter covariance 
was known from forecasts, but the covariance with the 
new foreground model was unknown. To avoid wasting 
computer time on simple tests, I approximate the full 
Planck likelihood as Gaussian, with covariance given by 
that from the actual full run, and then do test runs start- 
ing with a diagonal fast-parameter covariance set to some 
initial guess at appropriate widths. 

First consider the case with non-adaptive sampling, 
so the proposal distribution is fixed from the beginning 
(and fast and slow parameters are not decorrelated be- 
cause the full covariance is unknown). In this case the 
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dragging method can perform significantly better, though 
here the fast and slow parameters are sufficiently weakly 
correlated that dragging performs similarly to Metropolis 
sampling with high /fast and there is no clear advantage. 
In extended models with stronger correlations between 
fast and slow parameters the dragging scheme is expected 
to perform better. 

However the adaptive scheme that learns the covari- 
ance and in subsequent steps uses it to define the pro- 
posal distributions and fast-slow decorrelation is around 
twice as efficient for P/ancfc+WP, and around three times 
as efficient for P/an,cfc-|-WP+highL, so there is a clear ad- 
vantage to using an adaptive method. This gain would 
be even larger if new correlated slow parameters were 
also being added. The adaptive proposal learning scheme 
is straightforwardly combined with any of the fast-slow 
sampling methods, and is recommended unless a good 
covariance is known a priori. 



C. Likelihood requirements 

As we have seen, significant performance gains are pos- 
sible using fast-slow sampling methods. For this to work 
it is essential that all parts of the likelihood calculation 
that have different speed and parameter dependence are 
separated. In some cases this may require a significant 
amount of additional thought into how to structure the 
calculation. If the computational cost of changing the 
fast parameters is small but non-negligible, there will be 
additional gains to be had by optimizing the fast likeli- 
hood, even though in a non-fast-slow method it would 
contribute insignificantly to the total numerical cost. 



V. CONCLUSIONS 

Using efficient fast-slow sampling methods can sig- 
nificantly improve the performance of parameter infer- 
ence when there are large numbers of fast parameters. 
I've demonstrated that a simple speed-ordered Cholesky 
orthogonalization can provide substantial performance 
gains for problems currently of interest in cosmology. 
This can be combined with an adaptive covariance learn- 
ing scheme and/or the fast-parameter dragging method 
to improve robustness and performance when there are 
non-trivial correlations or dependencies between the fast 
and slow parameters. Although I have focussed on ap- 
plication in cosmology, the algorithms are general, and a 
wide class (though certainly not all) problems are likely 
to have a similar speed structure. 

I have only considered here the case of parameter esti- 
mation, but similar efRciency gains should also be achiev- 
able for evidence (free energy) calculations. Algorithms 
such as thermodynamic integration generalize straight- 
forwardly to exploit fast and slow parameters, other al- 
gorithms however may require more work to adapt and 
this should be the subject of further investigation. A va- 



riety of other general fast-slow sampling methods have 
also been proposed by Ref. J16j . which may give perfor- 
mance gains in some problems. 

The focus of this paper in on direct fast-slow sampling 
schemes for generating samples from the full posterior. 
However, in particular cases, there may of course be bet- 
ter alternatives. For example it may be possible to accu- 
rately approximate the slow part of the likelihood. If the 
approximation can be made accurate enough (without 
an over-expensive precomputation step), then full chains 
can simply be run directly. For example in the cosmolog- 
ical context the PICO approximation [17] can be used 
to calculate the CMB power spectra quite accurately, 
which can save a lot of time if the goal is to explore many 
different fast nuisance parameter models with the same 
underlying set of cosmological models. In other cases 
the approximation may not be accurate enough, but still 
a sufficiently good approximation that later importance 
sampling correction from the exact likelihood works well 
(see e.g. ^). This can be a good solution if the full likeli- 
hood is too slow for a fast-slow direct sampling method to 
be useable, and allows rapid full exploration in the fast- 
approximated likelihood space; only the highly-thinned 
independent samples then need to be corrected by later 
importance sampling, a step that is trivially paralleliz- 
able. In cases where importance sampling cannot be used 
efficiently there are also sampling schemes that can ex- 
ploit fast approximations, see for example Ref. ,18J and 
references therein. 

The numerical Fortan 2003 code COSMOMC discussed 
here is publicly available^, which implements both stan- 
dard Metropolis and dragging sampling methods, along 
with the GetDist program for analysing samples and 
generating marginalized 1, 2 and 3-D posteriors, and 
python scripts for managing and running grids of runs, 
processing the results, and making latex tables and 
plots (as used in the main Planck parameter analysis 
of Ref. 0). 
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FIG. 4; Possible radial proposal distributions. An n-D Gaus- 
sian proposal distribution corresponds to choosing a random 
direction in parameter space and proposing a move by dis- 
tance r in that direction with probability Pnir); larger n be- 
come more sharply peaked near one, and in particular very 
rarely propose much smaller or larger moves. The thick black 
line P„f{r) shows a mixture with fraction / = | of P2{r) and 
fraction 1 — / of an exponential distribution. This has much 
broader tails and does not go to zero at r ~ 0, and is the 
proposal distribution used by COSMOMC. It is much more 
robust to covariance matrix misestimation than a high n-D 
Gaussian proposal distribution (though slightly less optimal 
in the ideal case). 



Appendix A: Choice of proposal distribution 

Metropolis sampling methods work with any symmet- 
ric proposal distribution in principle, as long as it allows 
eventual exploration of the full parameter space. Gaus- 
sian proposal distributions are often used, but these are 
not necessarily optimal. 

As discussed in the text, it usually pays to transform 
to orthogonalized parameters for efficient parameter ex- 
ploration. The fast-slow Cholesky parameter redefinition 
achieves this, but it remains to make a choice of proposal 
distribution in the orthonormalized parameter space. 

First consider the case of sampling an orthonormalized 
parameter space using a Metropolis method with an A^- 
dimensional Gaussian proposal distribution. Sampling 
from an A'^-D Gaussian proposal distribution amounts to 
choosing a direction at random (a direction on the surface 
of an A^ — 1 sphere), then making a proposal -PAr(r) for 
the distance r along that direction with 



Pn{r) ex r" 



72 



(Al) 



The distance may be scaled by a factor s, and Ref. [191 120] 
show that s ~ 2.4 is optimal in terms of minimizing the 
correlation length when the target distribution is Gaus- 
sian. I define P^(r) ex Pn{r/s) so that the optimal radial 



proposal distribution for the choice of an A^-dimensional 
Gaussian in P'^^{r). 

One simple improvement on this is to cycle directions 
rather than choosing a random direction for each move. 
There are N orthonormal eigendirections that make a 
basis for the space, but proposals along these directions 
can be made in any order. One good option is to first 
choose a random basis rotation, then cycle through the 
basis, making proposals along each basis vector direction 
in turn. When all the basis vector directions have been 
proposed, generate a new random basis and repeat. This 
cycling avoids random directions sometimes heading back 
where they have just come from, and helps to remove 
random noise from the number of proposals in different 
directions. This can improve exploration especially in 
relatively low dimensions. 

The choice of an TV-dimensional proposal distribution 
may also be suboptimal. Consider separating the choice 
of direction to move in and distance to move. For exam- 
ple we could try some radial proposal function Pn{f) for 
n ^ N, which would correspond (for n < N) to making 
a Gaussian proposal in an n-D subspace. A good and 
robust proposal distribution performs well in the ideal 
case, but also doesn't perform too badly if the proposal 
width is not quite correct (for example if it has been es- 
timated from a fairly short initial sampling period). In 
this respect P|'*(r) is significantly superior to P'^'^{r) 
when sampling number of dimensions N > 2: the dis- 
tribution is significantly less peaked, and therefore much 
less sensitive to the width being chosen incorrectly. See 
Fig.i 

Consider for example the case of a factor four overes- 
timation of the proposal width: sampling a 7D Gaussian 
with P2'^{r) the autocorrelation at 50 steps is C50 ~ 0.45, 
whereas for P^-^{r) it is much worse at C50 ^ 0.9. For 
factor 4 underestimation Pj-*{r) performs only slightly 
better. For factor 8 overestimation C50 ^ 0.8 when using 
P2''^{'''), which is slow but not completely useless, how- 
ever Pj'^{r) performs catastrophically badly because it 
almost never proposes small moves. For the ideal Gaus- 
sian case when the proposal width is matched, P^'^{r) 
gives Cio ~ 0.3 as opposed to P2'^{r) which gives the 
slightly less optimal correlation length Cio ~ 0.42. This 
cost may however be worth paying for more robustness. 
Also note that decorrelation is not necessarily a good 
indicator of efficient exploration of the full parameter 
space; for example a wide fixed-width proposal can of- 
ten flip between tails on opposite sides of the mean and 
give rapid decorrelation, but leave the central region and 
more extreme tails poorly explored (and hence poor over- 
all convergence). 

Underestimation of the proposal width can also be 
problematic and lead to slow random walk exploration. 
Using a proposal distribution with thicker tails helps 
with this. But underestimation is generally less prob- 
lematic than overestimation because adaptive proposal 
updates (see Sec. IVB 2 ) will gradually increase the pro- 



posal width to something more appropriate (whereas a 
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large overestimation can lead to no chain movement in 
some direction, and hence no useful estimated covariance 
eigenvalue). 

Having a broader proposal distribution is also likely 
to be advantageous when the target is non-Gaussian, for 
example containing broad tails or nearly-isolated local 
maxima. The sampling can be made even more robust 
to wrong proposal width estimation by increasing the 
probability for small and large distance proposals. One 
simple way to do this is to mix in some component of an 
exponential distribution: 

P„/(r)-/P„(r) + (l-/)e-'-. (A2) 

Taking / ^ 2/3, n = 2 seems to work well, with perfor- 



mance not much effected by the choice; this is the default 
choice in the COSMOMC code. By design the efficiency 
is not very sensitive to the proposal width, with a scaling 
of 1.5-2.5 generally working well and giving acceptance 
probabilities in the range 0.2-0.5. The distribution shape 
is rather similar to using Gaussian proposals in random 
1 to 3D subspaces, but with somewhat broader tails at 
large r. For specific idealized cases this proposal dis- 
tribution may be slightly suboptimal, but it is usually 
much more important for typical usage to perform well 
in most cases than to perform optimally in very specific 
test cases. 



[1] R. M. Neal (2005), "math.ST/0502099' 

[2] A. Lewis and S. Bridle, Phys. Rev. D66, 103511 (2002), 

astro-ph/0205436 
[3] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, 

A. H. Teller, and E. Teller, Journal of Chemical Physics 

21, 1087 (1953). 
[4] W. Hastings, Biometrika 57, 97 (1970). 
[5] A. Kosowsky, M. Milosavljevic, and R. Jimenez, Phys. 

Rev. D66, 063007 (2002), astro-ph/020 6014_ 
[6] R. Neal, Statistics and Computing 6, 353 (1996), 

http://www. cs .toronto. edu/~radf ord/ttrans ■ | 



I abstract . htmll 
[7] A. Gelman and D. Rubin, Statistical Science 7, 
457 (1992), http://www.stat.coluinbia.edu/~gelman/ 
research/published/itsim.pdf 
[8] S. P. Brooks and A. Gelman, Journal of Compu- 
tational and Graphical Statistics 7, 434 (1998), 
ihttp : //www . Stat . Columbia . e du/~gelman/research/ | 
published/brooksgelman2 . pdfl 
"[9pp. Ade et al. (Planck Collaboration) (2013), ■1303.5076' 
[10] A. Lewis, A. Ch allinor, and A. Las enby, Astrophys. J. 
538, 473 (2000), |astro-ph/99m77| 



e: 



[11] C. Bennett, D. Larson, J. Weiland, N. Jarosik, G. Hin- 

shaw, et al. (2012), 1212.5225 
[12] C. Reichardt, L. Shaw, O. Zahn, K. Aird, B. Benson, 

et al., Astrophys.J. 755, 70 (2012), 1111.0932 
[13] S. Das, T. Louis, M. R. Nolta, G. E. Addison, E. S. 

Battistelli, et al. (2013), |1301.1037| 
[14] E. S. H. Haario and J. Tamminen, Bernoulli 7, 

223 (2001), |http : //pro j ecteuclid . org/euclid . bj7 
I 108 0222083 

[15] M. Vihola, Electronic Journal of Probability 16, 45 

(2011), 0911.0522, 

[16] R. M. Neal (20117, |1101.0387 | 

[17] W. A. Fendt and B. DrWandelt, Astrophys.J. (2007), 

10712.01 94 
[18] C. Wang and R. M. Neal (2013), 1305.2235 
[19] A. Gelman, G. Roberts, and W. Gilks, in Bayesian 

Statistics, edited by J. M. Bernado et al. (OUP, 1996), 

vol. 5, p. 599. 
[20] J. Dunkley, M. Bucher, P. G. Ferreira, K. Moodley, and 

C. Skordis, Mon. Not. Roy Astron. Soc. 356, 925 (2005), 

astro-ph/0405462. 



